Goto

Collaborating Authors

 Framingham


Overcoming Selection Bias in Statistical Studies With Amortized Bayesian Inference

arXiv.org Machine Learning

Selection bias arises when the probability that an observation enters a dataset depends on variables related to the quantities of interest, leading to systematic distortions in estimation and uncertainty quantification. For example, in epidemiological or survey settings, individuals with certain outcomes may be more likely to be included, resulting in biased prevalence estimates with potentially substantial downstream impact. Classical corrections, such as inverse-probability weighting or explicit likelihood-based models of the selection process, rely on tractable likelihoods, which limits their applicability in complex stochastic models with latent dynamics or high-dimensional structure. Simulation-based inference enables Bayesian analysis without tractable likelihoods but typically assumes missingness at random and thus fails when selection depends on unobserved outcomes or covariates. Here, we develop a bias-aware simulation-based inference framework that explicitly incorporates selection into neural posterior estimation. By embedding the selection mechanism directly into the generative simulator, the approach enables amortized Bayesian inference without requiring tractable likelihoods. This recasting of selection bias as part of the simulation process allows us to both obtain debiased estimates and explicitly test for the presence of bias. The framework integrates diagnostics to detect discrepancies between simulated and observed data and to assess posterior calibration. The method recovers well-calibrated posterior distributions across three statistical applications with diverse selection mechanisms, including settings in which likelihood-based approaches yield biased estimates. These results recast the correction of selection bias as a simulation problem and establish simulation-based inference as a practical and testable strategy for parameter estimation under selection bias.


Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems

arXiv.org Artificial Intelligence

The emergence of AI-augmented Data Processing Systems (DPSs) has introduced powerful semantic operators that extend traditional data management capabilities with LLM-based processing. However, these systems face fundamental reliability (a.k.a. trust) challenges, as LLMs can generate erroneous outputs, limiting their adoption in critical domains. Existing approaches to LLM constraints--ranging from user-defined functions to constrained decoding--are fragmented, imperative, and lack semantics-aware integration into query execution. To address this gap, we introduce Semantic Integrity Constraints (SICs), a novel declarative abstraction that extends traditional database integrity constraints to govern and optimize semantic operators within DPSs. SICs integrate seamlessly into the relational model, allowing users to specify common classes of constraints (e.g., grounding and soundness) while enabling query-aware enforcement and optimization strategies. In this paper, we present the core design of SICs, describe their formal integration into query execution, and detail our conception of grounding constraints, a key SIC class that ensures factual consistency of generated outputs. In addition, we explore novel enforcement mechanisms, combining proactive (constrained decoding) and reactive (validation and recovery) techniques to optimize efficiency and reliability. Our work establishes SICs as a foundational framework for trustworthy, high-performance AI-augmented data processing, paving the way for future research in constraint-driven optimizations, adaptive enforcement, and enterprise-scale deployments.


Petal-X: Human-Centered Visual Explanations to Improve Cardiovascular Risk Communication

arXiv.org Artificial Intelligence

Cardiovascular diseases (CVDs), the leading cause of death worldwide, can be prevented in most cases through behavioral interventions. Therefore, effective communication of CVD risk and projected risk reduction by risk factor modification plays a crucial role in reducing CVD risk at the individual level. However, despite interest in refining risk estimation with improved prediction models such as SCORE2, the guidelines for presenting these risk estimations in clinical practice remained essentially unchanged in the last few years, with graphical score charts (GSCs) continuing to be one of the prevalent systems. This work describes the design and implementation of Petal-X, a novel tool to support clinician-patient shared decision-making by explaining the CVD risk contributions of different factors and facilitating what-if analysis. Petal-X relies on a novel visualization, Petal Product Plots, and a tailor-made global surrogate model of SCORE2, whose fidelity is comparable to that of the GSCs used in clinical practice. We evaluated Petal-X compared to GSCs in a controlled experiment with 88 healthcare students, all but one with experience with chronic patients. The results show that Petal-X outperforms GSC in critical tasks, such as comparing the contribution to the patient's 10-year CVD risk of each modifiable risk factor, without a significant loss of perceived transparency, trust, or intent to use. Our study provides an innovative approach to the visualization and explanation of risk in clinical practice that, due to its model-agnostic nature, could continue to support next-generation artificial intelligence risk assessment models.


Comparison of Machine Learning Classification Algorithms and Application to the Framingham Heart Study

arXiv.org Machine Learning

The use of machine learning algorithms in healthcare can amplify social injustices and health inequities. While the exacerbation of biases can occur and compound during the problem selection, data collection, and outcome definition, this research pertains to some generalizability impediments that occur during the development and the post-deployment of machine learning classification algorithms. Using the Framingham coronary heart disease data as a case study, we show how to effectively select a probability cutoff to convert a regression model for a dichotomous variable into a classifier. We then compare the sampling distribution of the predictive performance of eight machine learning classification algorithms under four training/testing scenarios to test their generalizability and their potential to perpetuate biases. We show that both the Extreme Gradient Boosting, and Support Vector Machine are flawed when trained on an unbalanced dataset. We introduced and show that the double discriminant scoring of type I is the most generalizable as it consistently outperforms the other classification algorithms regardless of the training/testing scenario. Finally, we introduce a methodology to extract an optimal variable hierarchy for a classification algorithm, and illustrate it on the overall, male and female Framingham coronary heart disease data.


CDRH Seeks Public Comment: Digital Health Technologies for Detecting Prediabetes and Undiagnosed Type 2 Diabetes

arXiv.org Artificial Intelligence

This document provides responses to the FDA's request for public comments (Docket No FDA 2023 N 4853) on the role of digital health technologies (DHTs) in detecting prediabetes and undiagnosed type 2 diabetes. It explores current DHT applications in prevention, detection, treatment and reversal of prediabetes, highlighting AI chatbots, online forums, wearables and mobile apps. The methods employed by DHTs to capture health signals like glucose, diet, symptoms and community insights are outlined. Key subpopulations that could benefit most from remote screening tools include rural residents, minority groups, high-risk individuals and those with limited healthcare access. Capturable high-impact risk factors encompass glycemic variability, cardiovascular parameters, respiratory health, blood biomarkers and patient reported symptoms. An array of non-invasive monitoring tools are discussed, although further research into their accuracy for diverse groups is warranted. Extensive health datasets providing immense opportunities for AI and ML based risk modeling are presented. Promising techniques leveraging EHRs, imaging, wearables and surveys to enhance screening through AI and ML algorithms are showcased. Analysis of social media and streaming data further allows disease prediction across populations. Ongoing innovation focused on inclusivity and accessibility is highlighted as pivotal in unlocking DHTs potential for transforming prediabetes and diabetes prevention and care.


SeFNet: Bridging Tabular Datasets with Semantic Feature Nets

arXiv.org Artificial Intelligence

Machine learning applications cover a wide range of predictive tasks in which tabular datasets play a significant role. However, although they often address similar problems, tabular datasets are typically treated as standalone tasks. The possibilities of using previously solved problems are limited due to the lack of structured contextual information about their features and the lack of understanding of the relations between them. To overcome this limitation, we propose a new approach called Semantic Feature Net (SeFNet), capturing the semantic meaning of the analyzed tabular features. By leveraging existing ontologies and domain knowledge, SeFNet opens up new opportunities for sharing insights between diverse predictive tasks. One such opportunity is the Dataset Ontology-based Semantic Similarity (DOSS) measure, which quantifies the similarity between datasets using relations across their features. In this paper, we present an example of SeFNet prepared for a collection of predictive tasks in healthcare, with the features' relations derived from the SNOMED-CT ontology. The proposed SeFNet framework and the accompanying DOSS measure address the issue of limited contextual information in tabular datasets. By incorporating domain knowledge and establishing semantic relations between features, we enhance the potential for meta-learning and enable valuable insights to be shared across different predictive tasks.


Tutorial: Deep Learning + OA & DCS for Heart Disease Prediction

#artificialintelligence

How likely will a person develop a heart disease condition within the next ten years? Our following BOTX tutorial shows how to create a machine learning model using neural networks, DCS (our integrated data platform), and an online assistant to predict this very question based on 15 parameters. This blog post and video are part one of the tutorials focusing on the machine learning model and DCS setup. Around 17.5 million people die each year from cardiovascular diseases (CVDs), an estimated 31% of all deaths worldwide. This statistic is expected to grow to more than 23.6 million by 2030.


Can AI Predict If Your House Is Going To Burn To The Ground?

#artificialintelligence

Standing on the outskirts of Oakland, California, Attila Toth takes in the nearby forested hills. The CEO looks out on what locals call "The Town" and, in the distance, San Francisco, or "The City." Close by, Toth sees tangles of redwood, eucalyptus and oak trees โ€“ and the wildfire risk they pose. This "wildland-urban interface" isn't far from the site of the 1991 Oakland Hills Fire, which flared up suddenly in a heavily residential area. Over four days, 3,000 thousand homes were destroyed in one of the city's wealthiest neighborhoods, causing an estimated $1.5 billion in damages ($3.2 billion in today's dollars).


Is Machine Learning The Future Of Coffee Health Research? - AI Summary

#artificialintelligence

The stories generally go like this: "a study finds drinking coffee is associated with a X% decrease in [bad health outcome]" followed shortly by "the study is observational and does not prove causation." In a new study in the American Heart Association's journal Circulation: Heart Failure, researchers found a link between drinking three or more cups of coffee a day and a decreased risk of heart failure. Led by David Kao, a cardiologist at University of Colorado School of Medicine, researchers re-examined the Framingham Heart Study (FHS), "a long-term, ongoing cardiovascular cohort study of residents of the city of Framingham, Massachusetts" that began in 1948 and has grown to include over 14,000 participants. Able to analyze massive amounts of data in a short amount of time--as well as be programmed to handle uncertainties in the data, like if a reported cup of coffee is six ounces or eight ounces--machine learning can then start to ascertain and rank which variables are most associated with incidents of heart failure, giving even observational studies more explanatory power in their findings. And indeed, when the results of the FHS machine learning analysis were compare to two other well-known studies, the Cardiovascular Heart Study (CHS) and the Atherosclerosis Risk in Communities study (ARIC), the algorithm was able "to correctly predict the relationship between coffee intake and heart failure."


Build a Machine Learning Web App in 5 Minutes - KDnuggets

#artificialintelligence

The past year has seen a massive increase in the scope of data related roles. Most aspiring data professionals tend to put a lot of focus on model building, and there is less emphasis placed on other elements of the data science lifecycle. Due to this, many data scientists are unable to work in an environment outside of a Jupyter Notebook. They are unable to get their models into the hands of an end-user, and rely on external teams to do this from them. In smaller companies that don't have a data pipeline in place, these models never see the light of day.